ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group98c.txt / 000037_icon-group-sender _Wed Sep 16 16:46:55 1998.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 5KB

Return-Path: <icon-group-sender> Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239]) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id QAA10662 for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Wed, 16 Sep 1998 16:46:55 -0700 (MST) Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM) id AA05606; Wed, 16 Sep 1998 16:46:26 -0700 To: icon-group@optima.CS.Arizona.EDU Date: 16 Sep 1998 15:27:22 -0400 From: richard@goon.stg.brown.edu (Richard L. Goerwitz III) Message-Id: <6tp3eq$buu@goon.stg.brown.edu> Organization: Brown University Scholarly Technology Group Sender: icon-group-request@optima.CS.Arizona.EDU References: <9809120038.AA13184@hawk.CS.Arizona.EDU> Subject: Re: Unicode support or support for non-ASCII based character ma Errors-To: icon-group-errors@optima.CS.Arizona.EDU Status: RO Re Unicode, Gregg Townsend <gmt@baskerville.CS.Arizona.EDU> wrote: > -- In Unicode, there aren't just 26 lower-case letters, and > they're not all contiguous. What should &lcase contain? > How would this affect existing programs? It's funny that this issue has raised its ugly head again. I guess it was four or five years ago that I advocated moving to Unicode (the logic was that Icon was a good string processing language, and that it was kind of silly to confine it to an eight-bit universe). The problem with doing this back then was one of resources (the Icon Project was winding down, and graphics had become the main concern). Users had some good spats over Unicode, but ultimately nobody had any resources to bring it off - and some prominent members of the Icon community denied that Unicode would ever be- come a prominent standard (aside: sixteen-bit characters are now the default for NT; Java also works on the same assumption; and Unicode is the core format for XML, too). Anyway, aside from me annoying everyone with my Unicode "mantra" (as Clint called it once), the issue pretty much went away. To the questions you raised (re lowercase letters, affect on existing programs): The number of alphabets with case distinctions is finite, and in fact rather low (Latin, Greek, Coptic, Armenian, and Cyrillic). So you define everything (except specifically uppercase letters in these scripts) to be lowercase. Then you define everything (except specifically lowercase letters in these scripts) to be uppercase. Note that in some languages the number of lowercase letters ex- ceeds the number of uppercase letters (and the mappings are not one-to-one). But in most cases, reasonable equivalents can be conjured up. I'd be happy to contribute code, if it comes to that. As for how this would affect existing programs: Heavily. The assumption of Unicode characters really screws up everything. Not only do you have to worry about upper/lowercase mapping, but you also have to think about (as GT notes above) I/O. Some ideas: 0) data and character streams should be read using different functions 1) user should be permitted to select a (default) input format (e.g., ISO 8859-1, UTF-8, UTF-16, etc.) for character stream readers 2) user should be permitted to select a (default) output format (e.g., ISO 8859-1, UTF-8, UTF-16, etc.) for character stream writers 3) all internal strings must be represented as UCS-2 (or UCS-4, if you don't care about memory); you can't use UTF-8, because multi-byte variable-length characters are no good for any code that relies on fixed-width characters For most programs, the user would just select ISO 8859 as the default character input and output format. The whole thing is a mess, and my guess is that unless there is a new source of funding for Icon - and an effort to put in a ground-up rewrite - it would not be feasible to rearrange every- thing to support Unicode. If it does happen, I suspect that it will happen in a descendant language that incorporates fundamental features like networking. Two years ago or so I saw some excellent networking extensions to Icon posted here, incidentally. Seemed to me they gave Icon a fighting chance. What ever became of them? One other thing to think about: What will the PERL community so on the Unicode/XML front? Keep your eyes peeled. -- Richard Goerwitz PGP key fingerprint: C1 3E F4 23 7C 33 51 8D 3B 88 53 57 56 0D 38 A0 For more info (mail, phone, fax no.): finger richard@goon.stg.brown.edu